Data frames

Data frames are basically the common tables you know from excel or from anywhere on the internet. Usually data.frame is the product of your long effort to preprocess and clean the data. To combine what we already know, data.frames are lists of vectors of the same size, which have functionality ot easily access rows of data across multiple vectors.

Data frames columns MUST have same length - missing values can be replaced with NAs, NaNs or NULLs; And similarly to the vector restraint, each column must have only a single variable type.


In [32]:
set.seed(1)
age = sample(c(10:25), 25, replace = T)
gender = sample(c("male", "female"), 25, replace = T)
smoker = sample(c(T, F), 25, replace = T)
BMI = rnorm(25, 20, 2)

df = data.frame(age = age, gender = gender, smoker = smoker, BMI = BMI)

There are some simple functions to examine data.frames


In [33]:
head(df)


agegendersmokerBMI
114 male 1 22.4766082017068
215 male 0 19.4413074362915
319 male 1 23.5158061796214
424 female 1 21.1214921817761
513 male 1 19.0944320548937
624 male 1 18.3359134077643

In [34]:
summary(df)


      age           gender     smoker             BMI       
 Min.   :10.00   female:14   Mode :logical   Min.   :15.55  
 1st Qu.:14.00   male  :11   FALSE:8         1st Qu.:18.34  
 Median :19.00               TRUE :17        Median :19.65  
 Mean   :18.04               NA's :0         Mean   :19.84  
 3rd Qu.:22.00                               3rd Qu.:21.12  
 Max.   :25.00                               Max.   :24.88  

In [35]:
nrow(df)
ncol(df)


25
4

Columns

Remember theat each column is basically a vector. Therefore if you select the vector, you can run any functions on it. It is also important to know the different types of subsetting lists. Single [n] will select the n-th element of a list WITH the name of the list - tehrefore it doesn't return a vector per se. Double [[n]] on the


In [36]:
df[3]
df[[3]]


smoker
1TRUE
2FALSE
3TRUE
4TRUE
5TRUE
6TRUE
7TRUE
8FALSE
9FALSE
10TRUE
11FALSE
12TRUE
13TRUE
14TRUE
15FALSE
16TRUE
17TRUE
18FALSE
19TRUE
20FALSE
21TRUE
22FALSE
23TRUE
24TRUE
25TRUE
  1. TRUE
  2. FALSE
  3. TRUE
  4. TRUE
  5. TRUE
  6. TRUE
  7. TRUE
  8. FALSE
  9. FALSE
  10. TRUE
  11. FALSE
  12. TRUE
  13. TRUE
  14. TRUE
  15. FALSE
  16. TRUE
  17. TRUE
  18. FALSE
  19. TRUE
  20. FALSE
  21. TRUE
  22. FALSE
  23. TRUE
  24. TRUE
  25. TRUE

Other way of selecting vectors is to follow the list way of selecting elements by name. That way uses $ operator. This selection is effectively same as the sellection with [[n]]. But remember, that if you want to use name of the column in brackets, you need to put a string there [["smoker"]] (otherwise it will search for a smoker variable).


In [37]:
df$smoker
df[["smoker"]]
df[["smoker"]] == df$smoker


  1. TRUE
  2. FALSE
  3. TRUE
  4. TRUE
  5. TRUE
  6. TRUE
  7. TRUE
  8. FALSE
  9. FALSE
  10. TRUE
  11. FALSE
  12. TRUE
  13. TRUE
  14. TRUE
  15. FALSE
  16. TRUE
  17. TRUE
  18. FALSE
  19. TRUE
  20. FALSE
  21. TRUE
  22. FALSE
  23. TRUE
  24. TRUE
  25. TRUE
  1. TRUE
  2. FALSE
  3. TRUE
  4. TRUE
  5. TRUE
  6. TRUE
  7. TRUE
  8. FALSE
  9. FALSE
  10. TRUE
  11. FALSE
  12. TRUE
  13. TRUE
  14. TRUE
  15. FALSE
  16. TRUE
  17. TRUE
  18. FALSE
  19. TRUE
  20. FALSE
  21. TRUE
  22. FALSE
  23. TRUE
  24. TRUE
  25. TRUE
  1. TRUE
  2. TRUE
  3. TRUE
  4. TRUE
  5. TRUE
  6. TRUE
  7. TRUE
  8. TRUE
  9. TRUE
  10. TRUE
  11. TRUE
  12. TRUE
  13. TRUE
  14. TRUE
  15. TRUE
  16. TRUE
  17. TRUE
  18. TRUE
  19. TRUE
  20. TRUE
  21. TRUE
  22. TRUE
  23. TRUE
  24. TRUE
  25. TRUE

And the data.frame own way to select columns is to use its df[ROW, COLUMN] statement. Column part accepts numbers as well as string


In [38]:
df[,3]
df[,"smoker"]


  1. TRUE
  2. FALSE
  3. TRUE
  4. TRUE
  5. TRUE
  6. TRUE
  7. TRUE
  8. FALSE
  9. FALSE
  10. TRUE
  11. FALSE
  12. TRUE
  13. TRUE
  14. TRUE
  15. FALSE
  16. TRUE
  17. TRUE
  18. FALSE
  19. TRUE
  20. FALSE
  21. TRUE
  22. FALSE
  23. TRUE
  24. TRUE
  25. TRUE
  1. TRUE
  2. FALSE
  3. TRUE
  4. TRUE
  5. TRUE
  6. TRUE
  7. TRUE
  8. FALSE
  9. FALSE
  10. TRUE
  11. FALSE
  12. TRUE
  13. TRUE
  14. TRUE
  15. FALSE
  16. TRUE
  17. TRUE
  18. FALSE
  19. TRUE
  20. FALSE
  21. TRUE
  22. FALSE
  23. TRUE
  24. TRUE
  25. TRUE

In [39]:
a = "BMI"
df[, a]


  1. 22.4766082017068
  2. 19.4413074362915
  3. 23.5158061796214
  4. 21.1214921817761
  5. 19.0944320548937
  6. 18.3359134077643
  7. 17.6668589058306
  8. 17.8688188392234
  9. 16.872435897858
  10. 22.3130739943004
  11. 21.6640942571448
  12. 19.5453426171505
  13. 20.5322747233442
  14. 19.2465945628327
  15. 24.8827292577892
  16. 18.4093217654893
  17. 19.8902450525768
  18. 20.5002826457083
  19. 21.2364865871325
  20. 19.6547529947083
  21. 15.5521994519801
  22. 17.4727712300588
  23. 20.7174577919427
  24. 19.9779090430687
  25. 18.1187016747628

Subsetting

When we talk about subsetting data frames we usually mean selection of rows while keeping columns. But if you want to only kjeep some columns, use techniquest presented above.

There are many ways how to subset a data frame. The first thing to realise is that data frame is a list of vectors, therefore we can use similar functionality that lists have. The df[ROW, COLUMN] will also come in handy. If in doubt, go back to varaibles lecture about lists.

Basically we have two major ways of subsetting - using common indexing or using functions

Indexing

Indexing is possible with the use of either logical vectors or indices of rows. Imagine following daat frame

age smoker weight
17 yes 65
23 yes 87
25 no 74

In [40]:
small_df = data.frame(age = c(17, 23, 25), smoker = c(T, T, F), weight = c(65, 87, 74))

That means that you select the second row in these two ways.


In [41]:
small_df[c(F, T, F),]
small_df[2,]


agesmokerweight
223 187
agesmokerweight
223 187

Number indexing


In [42]:
age20smoker = which(df$age > 20 & smoker) # creating vector of indices
age20smoker
df[age20smoker,]


  1. 4
  2. 6
  3. 7
  4. 17
  5. 21
agegendersmokerBMI
424 female 1 21.1214921817761
624 male 1 18.3359134077643
725 female 1 17.6668589058306
1721 female 1 19.8902450525768
2124 female 1 15.5521994519801

Logical indexing

The use of logical vector style is much more common, but maybe a bit harder to wrap your head around. It basically selects all parts that evaluate to true.


In [43]:
numbers = 1:10
log = rep(c(T,F), 5)
numbers
log
numbers[log]


  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  1. TRUE
  2. FALSE
  3. TRUE
  4. FALSE
  5. TRUE
  6. FALSE
  7. TRUE
  8. FALSE
  9. TRUE
  10. FALSE
  1. 1
  2. 3
  3. 5
  4. 7
  5. 9

You can use logical vector of the


In [44]:
age20smoker = age > 20 & smoker #creating logical vector
age20smoker
df[age20smoker,]


  1. FALSE
  2. FALSE
  3. FALSE
  4. TRUE
  5. FALSE
  6. TRUE
  7. TRUE
  8. FALSE
  9. FALSE
  10. FALSE
  11. FALSE
  12. FALSE
  13. FALSE
  14. FALSE
  15. FALSE
  16. FALSE
  17. TRUE
  18. FALSE
  19. FALSE
  20. FALSE
  21. TRUE
  22. FALSE
  23. FALSE
  24. FALSE
  25. FALSE
agegendersmokerBMI
424 female 1 21.1214921817761
624 male 1 18.3359134077643
725 female 1 17.6668589058306
1721 female 1 19.8902450525768
2124 female 1 15.5521994519801

In [45]:
select_last = c(rep(F, 24), T)
select_last
df[select_last,]


  1. FALSE
  2. FALSE
  3. FALSE
  4. FALSE
  5. FALSE
  6. FALSE
  7. FALSE
  8. FALSE
  9. FALSE
  10. FALSE
  11. FALSE
  12. FALSE
  13. FALSE
  14. FALSE
  15. FALSE
  16. FALSE
  17. FALSE
  18. FALSE
  19. FALSE
  20. FALSE
  21. FALSE
  22. FALSE
  23. FALSE
  24. FALSE
  25. TRUE
agegendersmokerBMI
2514 female 1 18.1187016747628

In [46]:
df_smokers = df[smoker,]
df_smokers$BMI
mean(df_smokers$BMI)


  1. 22.4766082017068
  2. 23.5158061796214
  3. 21.1214921817761
  4. 19.0944320548937
  5. 18.3359134077643
  6. 17.6668589058306
  7. 22.3130739943004
  8. 19.5453426171505
  9. 20.5322747233442
  10. 19.2465945628327
  11. 18.4093217654893
  12. 19.8902450525768
  13. 21.2364865871325
  14. 15.5521994519801
  15. 20.7174577919427
  16. 19.9779090430687
  17. 18.1187016747628
19.8676893056573

In [47]:
zeny = gender == "female"
age22 = age > 22
zeny22 = zeny & age22
df[zeny22,]


agegendersmokerBMI
424 female 1 21.1214921817761
725 female 1 17.6668589058306
1825 female 0 20.5002826457083
2124 female 1 15.5521994519801

In [48]:
# maximal BMI "male" age < 24 non-smoker
males = gender == "male"
age24 = age < 24
nonsmoker = !smoker
male24nonsmoker = males & age24 & nonsmoker
df[male24nonsmoker,]$BMI


  1. 19.4413074362915
  2. 17.8688188392234
  3. 16.872435897858
  4. 24.8827292577892
  5. 17.4727712300588